Stability Based Sparse LSI/PCA: Incorporating Feature Selection in LSI and PCA

نویسندگان

  • Dimitrios Mavroeidis
  • Michalis Vazirgiannis
چکیده

The stability of sample based algorithms is a concept commonly used for parameter tuning and validity assessment. In this paper we focus on two well studied algorithms, LSI and PCA, and propose a feature selection process that provably guarantees the stability of their outputs. The feature selection process is performed such that the level of (statistical) accuracy of the LSI/PCA input matrices is adequate for computing meaningful (stable) eigenvectors. The feature selection process “sparsifies” LSI/PCA, resulting in the projection of the instances on the eigenvectors of a principal submatrix of the original input matrix, thus producing sparse factor loadings that are linear combinations solely of the selected features. We utilize bootstrapping confidence intervals for assessing the statistical accuracy of the input sample matrices, and matrix perturbation theory in order to relate the statistical accuracy to the stability of eigenvectors. Experiments on several UCI-datasets verify empirically our approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prompting the data transformation activities for cluster analysis on collections of documents

In this work we argue towards a new self-learning engine able to suggest to the analyst good transformation methods and weighting schemas for a given data collection. This new generation of systems, named SELF-DATA (SELF-learning DAta TrAnsformation) relies on an engine capable of exploring different data weighting schemas (e.g., normalized term frequencies, logarithmic entropy) and data transf...

متن کامل

Sparse Principal Component Analysis Incorporating Stability Selection

Principal component analysis (PCA) is a popular dimension reduction method that approximates a numerical data matrix by seeking principal components (PC), i.e. linear combinations of variables that captures maximal variance. Since each PC is a linear combination of all variables of a data set, interpretation of the PCs can be difficult, especially in high-dimensional data. In order to find ’spa...

متن کامل

Information retrieval in hydrochemical data using the latent semantic indexing approach

Petr Praus (corresponding author) Department of Analytical Chemistry and Material Testing, VSB-Technical University Ostrava, 17 listopadu 15, 708 33 Ostrava, Czech Republic Tel.:+420 59 732 3370 Fax: 420 59 732 3370 E-mail: [email protected] Pavel Praks Department of Mathematics and Descriptive Geometry, Department of Applied Mathematics, VSB-Technical University Ostrava, 17 listopadu 15, 708 3...

متن کامل

Clustering and Feature Selection using Sparse Principal Component Analysis

In this paper, we use sparse principal component analysis (PCA) to solve clustering and feature selection problems. Sparse PCA seeks sparse factors, or linear combinations of the data variables, explaining a maximum amount of variance in the data while having only a limited number of nonzero coefficients. PCA is often used as a simple clustering technique and sparse factors allow us here to int...

متن کامل

Convex Principal Feature Selection

A popular approach for dimensionality reduction and data analysis is principal component analysis (PCA). A limiting factor with PCA is that it does not inform us on which of the original features are important. There is a recent interest in sparse PCA (SPCA). By applying an L1 regularizer to PCA, a sparse transformation is achieved. However, true feature selection may not be achieved as non-spa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007